Skip to content

Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131

Open
elwhyjay wants to merge 6 commits into
microsoft:mainfrom
elwhyjay:fix/qdq-gemm-alpha-28130
Open

Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131
elwhyjay wants to merge 6 commits into
microsoft:mainfrom
elwhyjay:fix/qdq-gemm-alpha-28130

Conversation

@elwhyjay
Copy link
Copy Markdown
Contributor

@elwhyjay elwhyjay commented Apr 19, 2026

Description

Add an alpha == 1.0 check to GemmNodeGroupSelector::Check (symmetric to the existing beta == 1.0 check) so the QDQ Gemm → QGemm fusion is skipped when a bias is present and alpha != 1. Extend the QDQ Gemm transformer tests (both the default and fastmath variants) with alpha_not_one cases.

Motivation and Context

Fixes #28130.

When a Gemm has alpha != 1 and a bias, the QDQ fusion produces incorrect results. The root cause is in the QGemm kernel:

acc_int32 = (A - zpA)(B - zpB) + C_int        // bias added to int accumulator
Y_float   = (alpha * sa * sb) * acc_int32      // output scale applied to everything

The output scale (alpha * a_scale * b_scale) is applied to the whole int32 accumulator, which already contains the bias, so alpha ends up multiplying the bias too:

Y = alpha * A_dq * B_dq + alpha * C_float      (buggy)
Y = alpha * A_dq * B_dq + C_float              (expected)

This matches the reporter's observation that the discrepancy vanishes when the bias is zero.

This PR is the minimal correctness fix — it rejects the fusion in the broken case rather than producing wrong output. The follow-up discussed on the issue is to absorb alpha into the int32 bias at graph-transform time (C_int_new = round(C_int / alpha) inside GemmReplaceWithQuant), so alpha != 1 cases can keep the fused path. That is a separate PR because the precision/overflow characteristics want their own review.

Test plan

  • Extend QDQTransformerGemmTests and its fastmath variant with an alpha_not_one parameter. Set alpha=2.0 on the Gemm node when true, and adjust check_binary_op_graph so fusion is expected to be skipped when bias is present.
  • On main the new cases fail — both on op count and on output (expected -0.624, got -0.429, diff 0.195, tol 0.016), matching the bug report.
  • With the selector fix, QDQTransformerTests.Gemm_* (8), QDQTransformerTests.* (146), and GraphTransformationTests.Gemm* (16) all pass locally on macOS arm64.
  • End-to-end with the reporter's gemm_alpha.py against a locally built wheel (macOS arm64) matches optimized vs unoptimized across all cases:
    alpha=1.0, optimize=False     [[ 39 221 173]]
    alpha=1.0, optimize=True      [[ 39 221 173]]
    alpha=2.0, optimize=False     [[ 47 216 169]]
    alpha=2.0, optimize=True      [[ 47 216 169]]   (was [[0 255 218]] on main)
    alpha=2.0, optimize=False, bias=0   [[144 118 120]]
    alpha=2.0, optimize=True,  bias=0   [[144 118 120]]

@elwhyjay
Copy link
Copy Markdown
Contributor Author

Friendly ping on this one. @tianleiwu
Most CI is green; the only failure I see is windows_x64_asan, which appears to be an AddressSanitizer OOM in unrelated tests. Happy to rebase/update if preferred.
and would appreciate any guidance on whether this minimal fix is acceptable, or whether maintainers would prefer folding the follow-up bias-rescaling approach into this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an incorrect QDQ Gemm → QGemm fusion case by preventing fusion when a Gemm bias is present and alpha != 1, and extends the existing QDQ Gemm transformer tests to cover alpha != 1 scenarios (including fastmath).

Changes:

  • Update GemmNodeGroupSelector::Check to reject QDQ Gemm→QGemm fusion for bias-present Gemm when alpha != 1.
  • Add alpha_not_one coverage to QDQTransformerGemmTests in both the standard and fastmath test suites.
  • Update test graph expectations so fusion is skipped specifically for the bias + alpha_not_one combination.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Adds an alpha == 1 requirement (when bias is present) to avoid incorrect bias scaling in QGemm fusion.
onnxruntime/test/optimizer/qdq_transformer_test.cc Extends QDQ Gemm tests with alpha_not_one cases and updates fusion expectations accordingly.
onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc Mirrors the same alpha_not_one test coverage for the fastmath variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu
tianleiwu previously approved these changes May 1, 2026
@tianleiwu tianleiwu enabled auto-merge (squash) May 1, 2026 18:25
@elwhyjay
Copy link
Copy Markdown
Contributor Author

elwhyjay commented May 7, 2026

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

@tianleiwu
Copy link
Copy Markdown
Contributor

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

Done

@elwhyjay
Copy link
Copy Markdown
Contributor Author

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

@tianleiwu
Copy link
Copy Markdown
Contributor

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

Job re-tried. Hope that it can pass this time.

@elwhyjay
Copy link
Copy Markdown
Contributor Author

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented May 13, 2026

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

Last retry (failed 5 times). If it does not work, please merge/rebase main.

The Gemm→QGemm QDQ fusion selector only validated beta == 1, letting
Gemms with alpha != 1 and a bias through. QGemm broadcasts the int32
bias into the accumulator before applying the alpha*sa*sb output scale,
so the bias ends up scaled by alpha too — producing incorrect outputs
when alpha != 1 (bias == 0 masks the issue).

Add an alpha == 1 check alongside the existing beta == 1 check in
GemmNodeGroupSelector::Check (only when bias is present — without bias
the fused path is still exact). Extend QDQTransformerGemmTests and the
fastmath variant with an alpha_not_one parameter so the regression is
covered.

Follow-up tracked in the issue: absorb alpha into the int32 bias in
GemmReplaceWithQuant so alpha != 1 cases can keep the fusion.
auto-merge was automatically disabled May 13, 2026 02:26

Head branch was pushed to by a user without write access

@elwhyjay elwhyjay force-pushed the fix/qdq-gemm-alpha-28130 branch from 184d71d to e8c90e9 Compare May 13, 2026 02:26
@elwhyjay
Copy link
Copy Markdown
Contributor Author

@tianleiwu Sadly failed at the same point. As you suggested, rebased onto current main. Sory for the trouble

@elwhyjay
Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, sorry to keep bothering you with this. After rebasing to the main, the CI still failed at the same point with the same ASAN OOM, so I suspect the added test coverage might be pushing it over the memory limit. I've reduced the new cases down to the minimal essential ones (keeping the bias + alpha != 1 regression checks). Could you trigger CI once more when you have a moment?

@tianleiwu
Copy link
Copy Markdown
Contributor

@elwhyjay,

Here is AI analysis of the failed CI job:

The failure is most likely caused by TransposeOptimizerTests.TestDequantizeLinearNoAxis, which crashes under ASan with OOM immediately after starting:

Root cause

TestDequantizeLinearNoAxis invokes RunDequantizeLinearTestCase for 6 variants:

  • ONNX domain: uint8_t, uint16_t, int16_t
  • MS domain: uint8_t, uint16_t, int16_t

In RunDequantizeLinearTestCase, all variants use TransformerTester(...) across multiple opsets:

std::vector<int> opsets = {15, 18, 21};
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
  if (q_domain == kOnnxDomain) {
    opsets = std::vector<int>{21};
  }
}

So this single test expands into many ASan-heavy session builds. Since the log shows the crash happens exactly when TestDequantizeLinearNoAxis starts, the issue is test memory pressure, not a product correctness failure.

Best fix

Reduce the matrix for this specific no-axis test under ASan, especially for the 16-bit variants and/or MS-domain duplicates, because the behavior being validated is already covered by neighboring tests:

  • TestDequantizeLinearScalarIgnoreAxis
  • TestDequantizeLinearVector

A focused change is to keep one representative no-axis case and skip the redundant high-memory variants in ASan builds.

Suggested code change

Update TestDequantizeLinearNoAxis to limit coverage under ASan:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;  // Empty axis value will not be set.

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(__SANITIZE_ADDRESS__)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

More portable version for MSVC ASan

Because compiler macros differ on Windows, a safer pattern is to add a small local guard:

#if defined(__SANITIZE_ADDRESS__) || defined(ADDRESS_SANITIZER) || defined(_MSC_VER)
#define ORT_ASAN_BUILD 1
#endif

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

Preferred cleaner fix

If you want to preserve all coverage outside ASan without sprinkling guards in tests, add an optional opsets override or reduced-coverage mode to the helper and use it only here. Example:

template <typename QuantType>
static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
                                        const std::vector<int64_t>& zp_value_shape,
                                        std::optional<ONNX_NAMESPACE::AttributeProto> axis,
                                        const std::string& q_domain = "",
                                        std::vector<int> opsets_override = {}) {
  ...
  std::vector<int> opsets = opsets_override.empty() ? std::vector<int>{15, 18, 21} : std::move(opsets_override);

  if (opsets_override.empty()) {
    if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
      if (q_domain == kOnnxDomain) {
        opsets = std::vector<int>{21};
      }
    }
  }

  TransformerTester(..., opsets);
}

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  ...
#if defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain, {21});
#else
  ...
#endif
}

Why this is the right solution

The workflow is an ASan job (windows_build_x64_asan), and the log shows the process dies from memory exhaustion rather than a semantic assertion failure. The fix should therefore reduce peak memory from this test’s combinatorial coverage, not change runtime code.

@elwhyjay
Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, thank you for the analysis! I went with option 1 (__SANITIZE_ADDRESS__ guard) because adding _MSC_VER would also disable the variants on non-ASan Windows builds, and MSVC 16.9+ defines __SANITIZE_ADDRESS__ under /fsanitize=address anyway. Also restored the alpha_not_one cases I had trimmed earlier.

Option 3 changes the helper signature and other call sites, so I left it out of this PR - but I can add it here if you'd prefer, or send it as a small follow-up. Thank you again!

@tianleiwu tianleiwu enabled auto-merge (squash) May 22, 2026 04:07
tianleiwu
tianleiwu previously approved these changes May 22, 2026
@elwhyjay
Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700).
So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented May 26, 2026

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700). So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

@elwhyjay, The solution is either reduce memory usage of the test case, or reduce test parallel in ASan to avoid OOM.

Here is a PR for the later: #28675. You can merge it to your branch and test it.

auto-merge was automatically disabled May 27, 2026 01:45

Head branch was pushed to by a user without write access

@elwhyjay
Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, thank you for #28675! I've merged the latest main (which now includes your --test_parallel 4 change) and also reverted the ASan guard I had added to transpose_optimizer_test.cc since it's no longer needed. Could you trigger CI when you have a moment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization breaks QDQ-quantized Gemm graph when alpha != 1

3 participants